Skip to content

autodiscovery: advanced auto-config discovery via Python discover() bridge#50199

Closed
vitkyrka wants to merge 31 commits intomainfrom
vitkyrka/advanced-autoconfig-krakend
Closed

autodiscovery: advanced auto-config discovery via Python discover() bridge#50199
vitkyrka wants to merge 31 commits intomainfrom
vitkyrka/advanced-autoconfig-krakend

Conversation

@vitkyrka
Copy link
Copy Markdown
Contributor

@vitkyrka vitkyrka commented Apr 30, 2026

Summary

Generalises the krakend experiment into a reusable advanced-autoconfig path: rather than hard-coding an OpenMetrics prober and a %%discovered_port%% template variable in Go, the Agent now hands the probe decision to a Python discover(cls, service) classmethod on the integration's check class via a new rtloader bridge. The Python side decides what (if anything) to schedule and returns the fully-resolved instance configs back across the boundary.

This replaces the krakend-specific Go prober from the earlier revision of this PR with infrastructure any integration can opt into by:

  1. Shipping an auto_conf_discovery.yaml (ad_identifiers: + discovery: {} presence marker, plus an instances: template that the Python side may override).
  2. Adding a discover(cls, service) classmethod on its check class that returns list[dict] (or None on no match).

Tracks Confluence ticket DSCVR/6650004331.

Companion PR (Plan A helpers in datadog_checks_base.utils.discovery, the _run_discover Python bridge helper, and the krakend discover() migration): DataDog/integrations-core#23547 (branch vitkyrka/disco-autoconfig).

Implementation plan: docs/superpowers/plans/2026-05-06-discover-agent-bridge.md.

What's in this PR

New file format (auto_conf_discovery.yaml) — picked up by the existing file config provider (comp/core/autodiscovery/providers/config_reader.go). A non-nil discovery: block on integration.Config is the presence marker that this is a discovery template; the per-integration logic lives entirely on the Python side.

comp/core/autodiscovery/discoverer/ package — Go orchestration:

  • Discoverer / Bridge interfaces, decoupled from rtloader for testability.
  • defaultDiscoverer marshals the matched listeners.Service to JSON, calls the bridge, and converts the returned list-of-dicts into integration.Config values to schedule.
  • Cache keyed by (serviceID, integration_name); successes pinned, failures expire after 30s.
  • ErrPythonNotReady is treated as transient and not cached, so the next AD reconcile retries instead of sitting on a stale failure for the TTL.

rtloader run_discover bridge — new pure-virtual RtLoader::runDiscover, Three::runDiscover implementation (rtloader/three/three.cpp), C export in rtloader/rtloader/api.cpp, and the cgo wrapper pkg/collector/python/discover.go. The bridge calls datadog_checks.base.utils.discovery._run_discover(check_class, service_json) which builds a Service dataclass, invokes cls.discover(service), and returns the JSON-encoded result.

Lazy Python init from the bridge — mirrors the python check loader's existing pythonOnce.Do(InitPython) convention. Fixes the AD-vs-Python startup race for both the running agent and the agent check CLI subcommand without the rescan-on-ready plumbing that an earlier iteration of this branch carried (and that has since been reverted — see commits 7a95910 then 4c09170).

AD reconcile pathconfigmgr runs the discoverer before configresolver.Resolve whenever a template's Discovery field is set; on no-match the check is not scheduled (logged at DEBUG); on match the resolved instances are scheduled directly without going through any template-variable substitution.

Removed (vs. earlier revision of this PR) — the Go OpenMetrics prober, the serviceWithProbeResult wrapper, and the %%discovered_port%% template variable. The probe logic and any port hint handling now live in Python (krakend's discover() uses the http_probe + is_prometheus_exposition helpers from datadog_checks_base).

dev/e2e tooling

  • tasks/discovery_dev.py + test/dockerfiles/discovery-dev/dda inv discovery-dev.build-image produces an agent image with the dev tree bind-mounted, with a guard that fails fast when dda inv agent.build has re-linked rtloader against the host's libpython (the container ships Python 3.13).
  • docs/superpowers/2026-05-06-discover-e2e-smoke.md — manual smoke procedure (full build + bind-mount sequence) used to validate end-to-end against a real krakend container; intended as the basis for an automated harness.

Test plan

  • dda inv test --targets=./comp/core/autodiscovery/...,./pkg/collector/python — unit tests pass (discoverer with fake bridge, cache, providers, integration config).
  • dda inv linter.go — clean on touched packages.
  • bazel build //rtloader/... — C++ bridge builds; Three::runDiscover exercised through agent build.
  • End-to-end smoke against a real krakend:2.10 container per docs/superpowers/2026-05-06-discover-e2e-smoke.md: agent comes up, lazy-init triggers ~6s in, krakend check goes [OK] with 84 metrics/run sourcing http://<container-ip>:9090/metrics from the Python discover() result.
  • No-Python build path (e.g. cluster-agent): python_bridge_nopython.go stub keeps discoverer.New(nil) compiling and resolves discovery templates fail-closed.

Known limitation (carried forward)

The discoverer call still runs while the configManager mutex is held — serialises service reconciliation while Python is running. Acceptable for the experiment; should move outside the lock (or async) before broadening adoption.

🤖 Generated with Claude Code

@dd-octo-sts
Copy link
Copy Markdown
Contributor

dd-octo-sts Bot commented Apr 30, 2026

Go Package Import Differences

Baseline: 80e785f
Comparison: 15e6784

binaryosarchchange
agentlinuxamd64
+1, -0
+github.com/DataDog/datadog-agent/comp/core/autodiscovery/discoverer
agentlinuxarm64
+1, -0
+github.com/DataDog/datadog-agent/comp/core/autodiscovery/discoverer
agentwindowsamd64
+1, -0
+github.com/DataDog/datadog-agent/comp/core/autodiscovery/discoverer
agentdarwinamd64
+1, -0
+github.com/DataDog/datadog-agent/comp/core/autodiscovery/discoverer
agentdarwinarm64
+1, -0
+github.com/DataDog/datadog-agent/comp/core/autodiscovery/discoverer
agentaixppc64
+1, -0
+github.com/DataDog/datadog-agent/comp/core/autodiscovery/discoverer
iot-agentlinuxamd64
+1, -0
+github.com/DataDog/datadog-agent/comp/core/autodiscovery/discoverer
iot-agentlinuxarm64
+1, -0
+github.com/DataDog/datadog-agent/comp/core/autodiscovery/discoverer
heroku-agentlinuxamd64
+1, -0
+github.com/DataDog/datadog-agent/comp/core/autodiscovery/discoverer
cluster-agentlinuxamd64
+1, -0
+github.com/DataDog/datadog-agent/comp/core/autodiscovery/discoverer
cluster-agentlinuxarm64
+1, -0
+github.com/DataDog/datadog-agent/comp/core/autodiscovery/discoverer
cluster-agent-cloudfoundrylinuxamd64
+1, -0
+github.com/DataDog/datadog-agent/comp/core/autodiscovery/discoverer
cluster-agent-cloudfoundrylinuxarm64
+1, -0
+github.com/DataDog/datadog-agent/comp/core/autodiscovery/discoverer

@datadog-datadog-prod-us1-2

This comment has been minimized.

@dd-octo-sts
Copy link
Copy Markdown
Contributor

dd-octo-sts Bot commented Apr 30, 2026

Files inventory check summary

File checks results against ancestor 80e785f4:

Results for datadog-agent_7.80.0~devel.git.470.15e6784.pipeline.111287371-1_amd64.deb:

No change detected

@dd-octo-sts
Copy link
Copy Markdown
Contributor

dd-octo-sts Bot commented Apr 30, 2026

Static quality checks

✅ Please find below the results from static quality gates
Comparison made with ancestor 80e785f
📊 Static Quality Gates Dashboard
🔗 SQG Job

Successful checks

Info

Quality gate Change Size (prev → curr → max)
agent_deb_amd64 +38.45 KiB (0.01% increase) 740.927 → 740.965 → 750.310
agent_deb_amd64_fips +38.49 KiB (0.01% increase) 699.115 → 699.153 → 702.690
agent_heroku_amd64 +36.78 KiB (0.01% increase) 309.069 → 309.105 → 313.960
agent_msi +36.05 KiB (0.01% increase) 607.493 → 607.529 → 623.540
agent_rpm_amd64 +38.45 KiB (0.01% increase) 740.911 → 740.949 → 750.280
agent_rpm_amd64_fips +38.49 KiB (0.01% increase) 699.099 → 699.137 → 702.670
agent_rpm_arm64 +28.13 KiB (0.00% increase) 718.991 → 719.018 → 724.050
agent_rpm_arm64_fips +32.16 KiB (0.00% increase) 680.266 → 680.297 → 684.460
agent_suse_amd64 +38.45 KiB (0.01% increase) 740.911 → 740.949 → 750.280
agent_suse_amd64_fips +38.49 KiB (0.01% increase) 699.099 → 699.137 → 702.670
agent_suse_arm64 +28.13 KiB (0.00% increase) 718.991 → 719.018 → 724.050
agent_suse_arm64_fips +32.16 KiB (0.00% increase) 680.266 → 680.297 → 684.460
docker_agent_amd64 +40.11 KiB (0.00% increase) 801.303 → 801.342 → 805.870
docker_agent_arm64 +28.13 KiB (0.00% increase) 804.212 → 804.239 → 809.730
docker_agent_jmx_amd64 +40.12 KiB (0.00% increase) 992.223 → 992.262 → 996.590
docker_agent_jmx_arm64 +28.13 KiB (0.00% increase) 983.910 → 983.938 → 989.410
docker_cluster_agent_amd64 +28.03 KiB (0.01% increase) 206.583 → 206.610 → 207.600
docker_host_profiler_amd64 +3.19 KiB (0.00% increase) 301.103 → 301.106 → 315.800
docker_host_profiler_arm64 +3.41 KiB (0.00% increase) 312.616 → 312.619 → 327.400
iot_agent_deb_amd64 +28.03 KiB (0.06% increase) 44.454 → 44.482 → 44.970
iot_agent_deb_arm64 +20.03 KiB (0.05% increase) 41.439 → 41.458 → 42.560
iot_agent_deb_armhf +20.02 KiB (0.05% increase) 42.175 → 42.194 → 42.740
iot_agent_rpm_amd64 +28.03 KiB (0.06% increase) 44.455 → 44.482 → 44.970
iot_agent_suse_amd64 +28.03 KiB (0.06% increase) 44.455 → 44.482 → 44.970
9 successful checks with minimal change (< 2 KiB)
Quality gate Current Size
docker_cluster_agent_arm64 220.634 MiB
docker_cws_instrumentation_amd64 7.142 MiB
docker_cws_instrumentation_arm64 6.689 MiB
docker_dogstatsd_amd64 39.370 MiB
docker_dogstatsd_arm64 37.565 MiB
dogstatsd_deb_amd64 30.024 MiB
dogstatsd_deb_arm64 28.169 MiB
dogstatsd_rpm_amd64 30.024 MiB
dogstatsd_suse_amd64 30.024 MiB
On-wire sizes (compressed)
Quality gate Change Size (prev → curr → max)
agent_deb_amd64 +59.43 KiB (0.03% increase) 175.251 → 175.309 → 179.160
agent_deb_amd64_fips +43.51 KiB (0.03% increase) 166.983 → 167.026 → 174.440
agent_heroku_amd64 +10.45 KiB (0.01% increase) 74.952 → 74.963 → 80.310
agent_msi -16.0 KiB (0.01% reduction) 140.594 → 140.578 → 148.730
agent_rpm_amd64 +81.28 KiB (0.04% increase) 177.285 → 177.365 → 182.080
agent_rpm_amd64_fips +44.89 KiB (0.03% increase) 168.359 → 168.403 → 174.140
agent_rpm_arm64 +19.46 KiB (0.01% increase) 159.361 → 159.380 → 163.610
agent_rpm_arm64_fips +14.34 KiB (0.01% increase) 151.703 → 151.717 → 156.850
agent_suse_amd64 +81.28 KiB (0.04% increase) 177.285 → 177.365 → 182.080
agent_suse_amd64_fips +44.89 KiB (0.03% increase) 168.359 → 168.403 → 174.140
agent_suse_arm64 +19.46 KiB (0.01% increase) 159.361 → 159.380 → 163.610
agent_suse_arm64_fips +14.34 KiB (0.01% increase) 151.703 → 151.717 → 156.850
docker_agent_amd64 +27.76 KiB (0.01% increase) 267.680 → 267.707 → 272.990
docker_agent_arm64 +17.08 KiB (0.01% increase) 254.704 → 254.720 → 261.470
docker_agent_jmx_amd64 +50.99 KiB (0.01% increase) 336.309 → 336.358 → 341.610
docker_agent_jmx_arm64 -5.01 KiB (0.00% reduction) 319.361 → 319.357 → 326.050
docker_cluster_agent_amd64 neutral 72.413 MiB → 73.460
docker_cluster_agent_arm64 +8.87 KiB (0.01% increase) 67.866 → 67.875 → 68.680
docker_cws_instrumentation_amd64 neutral 2.999 MiB → 3.330
docker_cws_instrumentation_arm64 neutral 2.729 MiB → 3.090
docker_host_profiler_amd64 +15.6 KiB (0.01% increase) 110.742 → 110.757 → 125.600
docker_host_profiler_arm64 +8.1 KiB (0.01% increase) 105.070 → 105.078 → 120.000
docker_dogstatsd_amd64 neutral 15.238 MiB → 15.870
docker_dogstatsd_arm64 neutral 14.554 MiB → 14.890
dogstatsd_deb_amd64 neutral 7.941 MiB → 8.830
dogstatsd_deb_arm64 neutral 6.826 MiB → 7.750
dogstatsd_rpm_amd64 neutral 7.954 MiB → 8.840
dogstatsd_suse_amd64 neutral 7.954 MiB → 8.840
iot_agent_deb_amd64 +7.34 KiB (0.06% increase) 11.702 → 11.709 → 13.210
iot_agent_deb_arm64 +6.32 KiB (0.06% increase) 9.995 → 10.001 → 11.620
iot_agent_deb_armhf +4.33 KiB (0.04% increase) 10.204 → 10.208 → 11.780
iot_agent_rpm_amd64 +7.51 KiB (0.06% increase) 11.717 → 11.725 → 13.230
iot_agent_suse_amd64 +7.51 KiB (0.06% increase) 11.717 → 11.725 → 13.230

@cit-pr-commenter-54b7da
Copy link
Copy Markdown

cit-pr-commenter-54b7da Bot commented Apr 30, 2026

Regression Detector

Regression Detector Results

Metrics dashboard
Target profiles
Run ID: 51b861f0-7cbc-4885-80ce-9af0ac915eed

Baseline: 80e785f
Comparison: 15e6784
Diff

Optimization Goals: ✅ No significant changes detected

Experiments ignored for regressions

Regressions in experiments with settings containing erratic: true are ignored.

perf experiment goal Δ mean % Δ mean % CI trials links
docker_containers_cpu % cpu utilization -0.20 [-3.13, +2.73] 1 Logs

Fine details of change detection per experiment

perf experiment goal Δ mean % Δ mean % CI trials links
quality_gate_logs % cpu utilization +0.70 [-0.28, +1.68] 1 Logs bounds checks dashboard
tcp_syslog_to_blackhole ingress throughput +0.65 [+0.47, +0.83] 1 Logs
otlp_ingest_logs memory utilization +0.47 [+0.37, +0.57] 1 Logs
ddot_metrics_sum_cumulative memory utilization +0.27 [+0.11, +0.43] 1 Logs
ddot_metrics_sum_delta memory utilization +0.14 [-0.05, +0.33] 1 Logs
file_to_blackhole_0ms_latency egress throughput +0.03 [-0.50, +0.56] 1 Logs
docker_containers_memory memory utilization +0.02 [-0.08, +0.12] 1 Logs
file_to_blackhole_1000ms_latency egress throughput +0.01 [-0.41, +0.44] 1 Logs
file_to_blackhole_500ms_latency egress throughput +0.01 [-0.40, +0.41] 1 Logs
uds_dogstatsd_to_api ingress throughput -0.00 [-0.20, +0.19] 1 Logs
uds_dogstatsd_to_api_v3 ingress throughput -0.01 [-0.20, +0.19] 1 Logs
tcp_dd_logs_filter_exclude ingress throughput -0.01 [-0.10, +0.09] 1 Logs
file_to_blackhole_100ms_latency egress throughput -0.02 [-0.16, +0.11] 1 Logs
quality_gate_idle memory utilization -0.06 [-0.11, -0.02] 1 Logs bounds checks dashboard
ddot_metrics memory utilization -0.10 [-0.30, +0.09] 1 Logs
otlp_ingest_metrics memory utilization -0.10 [-0.27, +0.06] 1 Logs
ddot_metrics_sum_cumulativetodelta_exporter memory utilization -0.17 [-0.41, +0.06] 1 Logs
uds_dogstatsd_20mb_12k_contexts_20_senders memory utilization -0.18 [-0.23, -0.13] 1 Logs
docker_containers_cpu % cpu utilization -0.20 [-3.13, +2.73] 1 Logs
quality_gate_idle_all_features memory utilization -0.34 [-0.38, -0.30] 1 Logs bounds checks dashboard
ddot_logs memory utilization -0.76 [-0.82, -0.70] 1 Logs
file_tree memory utilization -0.78 [-0.83, -0.74] 1 Logs
quality_gate_metrics_logs memory utilization -1.23 [-1.47, -0.98] 1 Logs bounds checks dashboard

Bounds Checks: ✅ Passed

perf experiment bounds_check_name replicates_passed observed_value links
docker_containers_cpu simple_check_run 10/10 681 ≥ 26
docker_containers_memory memory_usage 10/10 244.51MiB ≤ 370MiB
docker_containers_memory simple_check_run 10/10 715 ≥ 26
file_to_blackhole_0ms_latency memory_usage 10/10 0.16GiB ≤ 1.20GiB
file_to_blackhole_0ms_latency missed_bytes 10/10 0B = 0B
file_to_blackhole_1000ms_latency memory_usage 10/10 0.21GiB ≤ 1.20GiB
file_to_blackhole_1000ms_latency missed_bytes 10/10 0B = 0B
file_to_blackhole_100ms_latency memory_usage 10/10 0.17GiB ≤ 1.20GiB
file_to_blackhole_100ms_latency missed_bytes 10/10 0B = 0B
file_to_blackhole_500ms_latency memory_usage 10/10 0.19GiB ≤ 1.20GiB
file_to_blackhole_500ms_latency missed_bytes 10/10 0B = 0B
quality_gate_idle intake_connections 10/10 3 ≤ 4 bounds checks dashboard
quality_gate_idle memory_usage 10/10 139.91MiB ≤ 147MiB bounds checks dashboard
quality_gate_idle_all_features intake_connections 10/10 3 ≤ 4 bounds checks dashboard
quality_gate_idle_all_features memory_usage 10/10 467.72MiB ≤ 495MiB bounds checks dashboard
quality_gate_logs intake_connections 10/10 4 ≤ 6 bounds checks dashboard
quality_gate_logs memory_usage 10/10 175.72MiB ≤ 195MiB bounds checks dashboard
quality_gate_logs missed_bytes 10/10 0B = 0B bounds checks dashboard
quality_gate_metrics_logs cpu_usage 10/10 349.22 ≤ 2000 bounds checks dashboard
quality_gate_metrics_logs intake_connections 10/10 3 ≤ 6 bounds checks dashboard
quality_gate_metrics_logs memory_usage 10/10 370.03MiB ≤ 430MiB bounds checks dashboard
quality_gate_metrics_logs missed_bytes 10/10 0B = 0B bounds checks dashboard

Explanation

Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%

Performance changes are noted in the perf column of each table:

  • ✅ = significantly better comparison variant performance
  • ❌ = significantly worse comparison variant performance
  • ➖ = no significant change in performance

A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".

For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:

  1. Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.

  2. Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.

  3. Its configuration does not mark it "erratic".

CI Pass/Fail Decision

Passed. All Quality Gates passed.

  • quality_gate_metrics_logs, bounds check cpu_usage: 10/10 replicas passed. Gate passed.
  • quality_gate_metrics_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
  • quality_gate_metrics_logs, bounds check missed_bytes: 10/10 replicas passed. Gate passed.
  • quality_gate_metrics_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
  • quality_gate_idle, bounds check memory_usage: 10/10 replicas passed. Gate passed.
  • quality_gate_idle, bounds check intake_connections: 10/10 replicas passed. Gate passed.
  • quality_gate_idle_all_features, bounds check intake_connections: 10/10 replicas passed. Gate passed.
  • quality_gate_idle_all_features, bounds check memory_usage: 10/10 replicas passed. Gate passed.
  • quality_gate_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
  • quality_gate_logs, bounds check missed_bytes: 10/10 replicas passed. Gate passed.
  • quality_gate_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.

@vitkyrka vitkyrka force-pushed the vitkyrka/advanced-autoconfig-krakend branch from 6f34723 to 15e6784 Compare May 4, 2026 16:05
@vitkyrka vitkyrka changed the title autodiscovery: declarative discovery probes (KrakenD experiment) autodiscovery: advanced auto-config discovery via Python discover() bridge May 5, 2026
@vitkyrka
Copy link
Copy Markdown
Contributor Author

vitkyrka commented May 5, 2026

Reopening as a draft from the renamed branch vitkyrka/disco-autoconfig — see successor PR. The work is unchanged; GitHub doesn't allow changing the head branch of an existing PR.

@vitkyrka vitkyrka closed this May 5, 2026
vitkyrka and others added 14 commits May 6, 2026 04:05
For the advanced auto-config experiment. New optional field on
integration.Config, populated by the auto_conf_discovery.yaml provider
in a follow-up commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Recognise the discovery: block in the file format and populate
integration.Config.Discovery. The file is picked up via the existing
.yaml extension matcher; only the configFormat struct gains a new
field and GetIntegrationConfigFromFile copies it into the returned
integration.Config.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Hints first (when exposed), then remaining exposed ports in declared
order. Dedup-aware.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per-(serviceID, configHash) cache. Successes never expire;
failures expire after caller-supplied TTL.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
HTTP-GET each candidate port + path with a 500ms per-probe budget
and a 2s overall budget. Verify Content-Type is text/plain or
application/openmetrics-text and that the body's first non-comment
line is a Prometheus exposition line. Cache success/failure per
(serviceID, config hash).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Tiny shim so %%discovered_port%% resolution can flow through the
existing GetExtraConfig path; no resolver signature change required.

Also tightens fakeService.GetExtraConfig in the prober tests to error
on unknown keys (matches the contract of real Service impls).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Routes via Resolvable.GetExtraConfig("discovered_port"). Populated by
autodiscovery/discovery's serviceWithProbeResult wrapper after a
successful probe.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When a Config has Discovery set, run the OpenMetrics prober against
the matched Service before configresolver.Resolve. On match wrap the
service so %%discovered_port%% resolves; on no match skip scheduling
the check (logged at DEBUG).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
SubstituteTemplateEnvVars is called at config-load time with a nil
service. Without a nil check, GetDiscoveredPort panicked on
res.GetExtraConfig. Match the pattern used by GetPort/GetPid/
GetHostname: return a NoResolverError early when res is nil so the
caller can ignore it (config_reader.go:517 already does).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…plan

Cross-language plan (Go + C++ + Python) for the Agent-side infrastructure
that calls a Python discover() classmethod via rtloader, replacing the
existing krakend-experiment Go prober and %%discovered_port%% template var.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
vitkyrka and others added 17 commits May 6, 2026 04:05
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
autoconfig.go calls discoverer.NewPythonBridge() unconditionally; without
this stub the symbol is undefined in builds where the python tag is absent
(e.g. cluster agent).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Records the exact build + bind-mount sequence that successfully validates
the Plan B implementation against a real krakend container. Includes the
pitfalls hit during the manual run (Python ABI mismatch, RUNPATH/RPATH
bind mounts, conf.d vs data/ confusion, Python init race) so an automated
harness can avoid each one.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The previous commit accidentally added "py" to ruff's exclude list to
work around a pre-commit hook failure on a transient local working-tree
directory. The directory is gone; revert the config change.
Surfaces ErrPythonNotReady from the Python bridge when rtloader has not
yet initialised, and skips the negative cache for that error so the next
AD reconcile event re-attempts the probe. Fixes a startup race where AD
reconciles before Python init completes (~30s gap), caches the failure,
and never re-probes in stable conditions — the krakend e2e smoke test
previously had to bounce the target container to clear the cache.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Resolves the AD-vs-Python-init startup race for advanced auto-config
templates. Previously, AutoDiscovery's first reconcile fired before
rtloader.Initialize completed; the discoverer returned ErrPythonNotReady
(uncached after the previous fix) and no future event triggered a retry
in stable conditions, so the integration's check was never scheduled
without manually bouncing the target container.

- pkg/collector/python: signalPythonReady closes a once-channel at the
  end of Initialize; WaitReady blocks on it.
- discoverer.WaitForPython is the public entry point (with a no-op stub
  for builds without the python tag, so cluster-agent compiles cleanly).
- configmgr.rescanDiscoveryTemplates iterates active services with
  Discovery templates and re-runs reconcileService for each.
- AutoConfig.start launches a fire-and-forget goroutine that waits for
  Python to be ready and then runs the rescan. The bridge MUST NOT
  block on Python init in the AD reconcile path: fx hooks are
  sequential and that would deadlock against the very hook that
  triggers Initialize.

Verified end-to-end against the krakend tests/docker compose: krakend
check is now scheduled ~9 s after agent start without any manual
container bounce, sourcing http://<container-ip>:9090/metrics from the
Python discover() result.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drops the manual krakend-bounce step now that AutoConfig automatically
re-reconciles services with discovery templates once Python is ready.
Adds a note on the "skipped — python not yet ready" startup log being
expected and benign, plus the dev/lib rtloader restore step (needed
after every agent rebuild because cmake links against host Python 3.12).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`dda inv agent.build` re-links rtloader against the host's
python3.X-dev headers and overwrites the bazel-built .so files in
dev/lib/. The resulting agent fails inside the discovery-dev image
with `libpython3.12.so.1.0: cannot open shared object file` because
the container ships Python 3.13.

Detect this by extracting the libpython version the rtloader is
linked against and confirming the matching libpython exists in
dev/embedded/lib/ (where bazel installs it). Fail with the exact
remediation commands instead of letting the user discover the issue
inside the running agent container.
This reverts commit 7a95910. The rescan-on-Python-ready mechanism is
being replaced by an in-bridge lazy InitPython that mirrors the python
check loader's existing convention (loader.go: pythonOnce.Do(InitPython)
when python_lazy_loading is true). The lazy-init shape is simpler, also
fixes the CLI agent check subcommand (which hits the same race in a
fresh process), and removes ~111 lines of one-shot recovery plumbing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirrors the python check loader convention (loader.go: pythonOnce.Do +
InitPython when python_lazy_loading is true). The discoverer is just
another consumer that needs Python; it runs init on demand if no
earlier consumer has done so.

This fixes the AD-vs-Python startup race for both the agent runtime
path AND the CLI 'agent check' subcommand. The previous
rescan-on-ready approach handled only the running-agent case (a fresh
process re-runs discovery from scratch and never gets a future event
to trigger the rescan).

The pythonOnce sync.Once shared with the loader makes init idempotent
across all callers. python_lazy_loading defaults to true; in eager
mode the collector still inits Python in its constructor and the
discoverer's check is a no-op.

Verified end-to-end against the krakend tests/docker compose: no
"skipped — python not yet ready" log, single straight-through
"Initializing rtloader" triggered by the discoverer ~6 s after agent
start, krakend check [OK] with 84 metrics/run.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drops the "skipped — python not yet ready" log discussion and the
rescan-goroutine description in favour of the new straight-through
lazy-init path: the discoverer triggers InitPython via pythonOnce, and
the krakend check appears [OK] within ~10 s of agent start.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

internal Identify a non-fork PR long review PR is complex, plan time to review it team/agent-devx team/agent-log-pipelines team/agent-runtimes team/container-platform The Container Platform Team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant